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Abstract. The most well known probability distribution of probabilities is the 
Beta distribution. If we have observed r ‘successes’, each having a probability 
6 , and n — r ‘failures’, each having a probability 1 — 6. In this paper we will 
derive a whole family of Beta-like distributions, which take as their data not 
only the number of successes and failures, but also values on predictor variables 
and time to failure or time without failure. 
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Preface 


The most well known probability distribution of probabilities is the Beta distri¬ 
bution. If we have observed r ‘successes’, each having a probability 9 , and n — r 
‘failures’, each having a probability 1 — 6. Then the corresponding Beta distribution 
if 9 is given as: 


p(9\r,n) = 


(n — 1)! 


-9 r ~ 1 (1 — 9) r 


(r — 1)! (n — r — 1)! 

We will proceed in this paper to derive a whole family of Beta-like distributions, 
which take as their data not only the number of successes and failures, but also 
values on predictor variables and time to failure or time without failure. 

The recurring theme in all this will be that, apart from the ordinary product 
and sum rules ( e.g ., Bayes’ theorem and the integrating out of nuisance parameters), 
a clrange-of-variable or, alternatively, a Jacobian transformation allows us to map 
the uncertainty we have, regarding the unknown parameter(s), as captured in 
the corresponding posterior, unto the probability of interest; thus, allowing us to 
construct a probability distribution of the probability of interest. 

The Beta-Like distribution is the distribution that takes into account the 
epistemological parameter uncertainty, as captured in the posterior distribution of 
these parameters, of the parameters of a given probability model. The Bayesian 
model selection, also discussed in this paper, takes into account the epistemological 
model uncertaint}^] 

In this paper we will be talking about failure mechanisms which have some 
underlying probabilistic ‘generating’ process. In light of this loose terminolgy, we 


Tn the Bayesian view of probability theory there is no uncertainty other than epistemological, 
seeing that a probability distribution over some set of propositions reflect our state of knowledge 
regarding the plausibilities of these propositions, MS . For example, the coin has a mass, a center 
of gravity, a circumference, a width, etc... . But it does not have the physical property: the 
probability of head or tails. And we know of at least one recorded instance were a coin landed 
spinning on its side and remained standing on its side as its spinning subsided, until it came to a 
halt, while still standing on its side. 


v 





vi PREFACE 

would like to give, as a caveat, the following quoted by Jaynes, who is considered by 
many to be the father of modern Bayesianity, [5j: 

[T]he judgment of a competent engineer, based on data of past 
experience in the field, represents information fully as ‘objective’ and 
reliable as anything we can possibly learn from a random experiment. 

Indeed, most engineers would make a stronger statement; since 
a random experiment is, by definition, one in which the outcome 
- and therefore the conclusion we draw from it - is subject to 
uncontrollable variations, it follows that the only fully ‘objective’ 
means of judging the reliability of a system is through analysis 
of stresses, rate of wear, etc., which avoids random experiments 
altogether. 

In practice, the real function of a reliability test is to check 
against the possibility of completely unexpected modes of failure; 
once a given failure mode is recognized and its mechanism un¬ 
derstood, no sane engineer would dream of judging its chances of 
occurring merely from a random experiment. 

In closing, the probability models used in this paper are by no way exhaustive. 
For example, we are currently studying the Negative Binomial probability model. 
A probability model which is particularly popular among seismologists. And we 
have already found that a lot of interesting things can be said about this about the 
Negative Binomial generating process. But this will be subject of another paper. 


2 In this quote Jaynes answers the charge that only long term frequencies of random experiments 
may be considered ‘objective’. Bayesian probability theory was formulated in 1774 by the physicist 
Laplace, who used this probability theory to identify those problems in celestial mechanics where 
the data seemed to contradict the then current theory. This allowed Laplace to be highly productive 
in this field of science, so much so, that in his time he was called the French Newton. But shortly 
after his death Laplace’s probability theory was attacked by a school of pure mathematicians, who 
thought the definition of a probability as a state of knowledge to be lacking in rigor. Instead, they 
proposed that a probability should mean to be the ‘observed’ long term frequency of (an imaginary 
infinity of) random experiments. For a time this viewpoint dominated the field so completely that 
those who were students in the period 1930-1960 were hardly aware that any other conception had 
ever existed, {§]. 



CHAPTER 1 


Explicit Probability Distributions for Probabilities 


1. Predictors, the Logistic Regression Model. 

1.1. The Probability Model. In the logistic regression model, we model the 
log-odds of some event by way of a regression model, say, 


log 


1-6 


Piz, 


( 1 . 1 ) 


where 2 is some predictor value. Identity (1.1) implies that the probability of success 


6 can be written down as as the following function of the unknown parameters 
and j3\\ 

exp (/3 0 + Piz) 


0 — 

1 + exp(/3 0 +Piz)' 

Likewise, the probability of a failure can be written down as: 

1 - 6 = 1 


( 1 . 2 ) 


(1.3) 


1 + exp (/? 0 + /3iz)' 

1.2. The Likelihood, Prior, and Posterior. Say, we have r successes, with 
corresponding predictor values x\,... ,x r , and n — r failures, with corresponding 


predictor values y i,... ,y n - r . Then, by way of (1.2) and (1.3), the probability of 


the observed data, or, equivalently, the likelihood of the unknown parameters po 
and 8i , can be written down as: 


p< w„,A) = m M =n ff 


El 1 + ex P (Po + PiXi) El 1 + ex P (/5o + Piyj) 

%—1 ? — J- 


(1.4) 


We assign some uniform prior to the unknown parameters /3 0 and p\, as is 
customary in Bayesian regression analysis, m- 


p(Po,Pi) oc constant, 


(1.5) 


where ‘oc’ is the proportionality sign. 


By multiplying the likelihood with the prior, respectively, (1.4) and (1.5), one 


may obtain, by way of product rule, or, equivalently, Bayes’ theorem, [6], the joint 
distribution of /3 q and /3i: 


p{D,Po,Pi) oc J] 


exp (p 0 +PiXi) 


i= i 1 + exp + PlXi) 


n T 


1 


+ exp (Po + Piyj)' 


( 1 . 6 ) 


1 
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1. EXPLICIT PROBABILITY DISTRIBUTIONS FOR PROBABILITIES 


Now, if we wish to obtain the probability distribution of the unknown parameters 
Po and Pi, conditional on the data D, or, equivalently, the posterior of Po and Pi, 
we must, by way of the product rule, [6j, divide ( |1.6| ) with the evidence 

(1.7) 

However, if we do this, then the inverse of the evidence, C -1 , being a constant not 
dependent upon the parameters Po and Pi, gets absorbed in the proportionality 


P( D ) = p(D,Pa,Pi)dPodpi = C. 


sign of (1.6); thus, giving us a posterior: 

p(D,p 0 ,pi) t4 exp (/3 0 + faxi) 


p(P o,Pi\D) = 


p(D) 


n 


^ 1 + exp (/3 0 + P\Xi) U 1 + exp (/3 0 + Piyj) ’ 

( 1 . 8 ) 


which is proportional to both the likelihood (1.4) and the joint distribution (1.6). 


However, we are not that much interested in the probability distribution of po 
and Pi. Since we are aiming for the probability distribution of the probability of a 


success 0, given a predictor value z, (1.2). 

Would we know the values of Po and p\ exactly, as we know our predictor value 


2, then we could substitute these values directly into (1.2) and, thus, get the exact 
probability 9. Now, we do not know the values of po and pi exactly. Instead, we 
have a range of probable values on the po- and /3i-axes, as captured by the posterior 


(1.8). This corresponds, through a two-to-one mapping, with a range of probable 


values on the 0-axis. This two-to-one mapping is, typically, accomplished by way of 
a Jacobian transformation. 

1.3. The Jacobian Transformation. By way of , we have that 

0 = exp (po + Piz) 

1 + exp (Po -(- piz)' 

So, a possible transformation from (Po,Pi) to {9,bp) is 


^ _ Q 

Po = - log— - -612:, 

which has a corresponding Jacobian of 


Pi — h, 


J = 


dofi° 



1 2 

8(1-9) 




0 1 


9 {i-ey 


(1.9) 


( 1 . 10 ) 


Substituting (1.9) into the posterior (1.8), and multiplying it with the Jacobian 


(1.10), gives us the transformed posterior: 

1 


p(9 1 bi\z,D) oc 


0 (1 - 0) n ]_ + e e ( Xi - z)bl 11 1 + e e ( Vj - z ) bl ■ 

2=1 1 —V 7=1 1 —v 


8 (xi-z)bi 

i-e c 


n 


(mi 
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If we numerically integrate out the unwanted parameter b\ out of (1.11), we get the 
posterior of the probability 9, given the data D and some predictor value z: 

p{9\z,D) = J p(9,b 1 \z,D)dbi, (1.12) 

which gives us the Bayesian logistic regression model we are looking for. 

Note that for non-informative data, that is, for predictors which all have the 
same value, that is, z = Xi = Uj, for i = 1 , ..., r and j = 1 , ..., n — r, the terms in 
the exponentials of (1.11) all become zero, and, as a consequence, (1.12) collapses 
to the ordinary Beta distribution: 




= 9 r ~ 1 (1 - Of 


(1.13) 


which is in nice correspondence with our intuition. 

If the predictors are non-informative, in that they ‘flat-line’, then the only 
pertinent aspect of the data D which remains, is the number of successes and 
failures, respectively, r and n — r, and these are just the sufficient statistics of the 
Beta distribution. 

2. Times to Failure and Times Without Failure, the Exponential Model 

2.1. The Probability Model. Say, we have an Exponential failure mecha¬ 
nism, then the probability of a failure at time t is 


P( 1 1 A) = A exp (—At) dt. 

Consequently, the probability of no failure until time r is 

/ OO 

A exp (—At) dt = exp (—At) . 


( 2 . 1 ) 


( 2 . 2 ) 


In most reliability problems we will be interested in determining probability 


(2.2). That is, in general we will wish to find the probability distribution of 

9 = exp (—At) , 


(2.3) 


where t is some desired life-time and A is the unknown parameter of the Exponential 
distribution. 

2.2. The Likelihood, Prior, and Posterior. Say, we have n identical units, 
which we follow in time. If we observe a sequence of r failure times, say, x \,..., x r , 
and n — r units that did not fail, these having, consequently, having times without 
failure, say, y i,... ,y n -r ■ Then, by way of (2.1) and (2.2), the probability of the 
observed data, or, equivalently, the likelihood of the unknown parameter A, can be 
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1. EXPLICIT PROBABILITY DISTRIBUTIONS FOR PROBABILITIES 


written down as 


p(D\X)=L(X) 

r n—r 

= A exp (—A Xi) dxi P exp (—A yj) 


(2-4) 


3 =1 


]f[ A exp (-At,;) exp (-A yj ). 


3 =1 


where we let the constant term ( dxt) r be absorbed in the proportionality sign. 

It would not be strange if our our prior information consisted of an initial gues 
of a life-time of, say, t. This initial guess corresponds with a prior likelihood of 


P(t | A) = A exp (—At) dt. 


(2.5) 


Combining the prior likelihood (2.5) with the uninformative Jeffreys’ prior for the 
inverse failure rate A, 


P{X) oc 


1 

A’ 


( 2 . 6 ) 


we get, by way of the product rule and the Bayesian proportionality short hand, 
the informative prior of A, based on the initial guess of a life-time of t: 

p( X\t) oc exp (—At), (2.7) 

where we have absorbed both the differential dt of (|2.5|) and the normalizing constant 


of (2.6) into the proportionality sign of (2.7). 


Note that the prior (2.7) may also be obtained through an alternative maximum 


entropy argument, [5]. But we give, instead, the above derivation. Because it is 
analogous to the derivation of the informative prior for a postulated Weibull failure 
mechanism, treated below. 


Combining the likelihood with the informative prior, respectively, (2.41 and 


(2.7), by way of the product rule and the Bayesian proportionality short hand, we 


get the posterior for the unknown parameter A: 

r n—r 

A| D , t'j oc exp ^ j | A exp ^ n exp (—A yj). 

i=i i=i 


( 2 . 8 ) 


The posterior (2.8) is the probability distribution of the unknown parameter A, 


conditional on the data D we have observed and our tentative guess of a life-time 
of t. However, we are not that much interested in the probability distribution of 
A. Rather, we are aiming for the probability distribution of the probability of the 


life-time exceeding r, (2.3). 


Would we know the value of A exactly, then we could substitute this value into 
(2.3) and, so, get the exact probability 9. Now, we do not know the value of A 
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exactly. Instead, we have a range of probable values on the A-axis, as captured by 
the posterior (2.8). This corresponds, through a one-to-one mapping, with a range 
of probable values on the 0-axis. 

This one-to-one mapping is, typically, accomplished by way of a change of 
variable. 


2.3. The Change of Variable. By way of (2.3), we have that 

9 = exp (—At) . 


So, the corresponding transformation is 

log# 

A —-, 


d\ = 


d9 

6t 


dO 

Or 


(2.9) 


Substituting the change of variable (2.9) into the posterior (2.8), we obtain the 
transformed posterior distribution of the probability 9 that a given unit will have a 
life-time exceeding r: 

log# 


p[6\T(D,t) ,r,r\ oc 1 ° 8 ^ exp 


T(D,t ) 


( 2 . 10 ) 

( 2 . 11 ) 


where 

r n—r 

T(D,t) ^t + ^Xi + ^yj, 

i=i j =l 

is the total observed time without failure, r is the number of observed failures, and 
r is the life-time that has to be exceeded. 


If we properly normalize (2.10), we get the identity: 
p[0\T(D,t) ,r,r] = 


T(D,t) 


r+1 


(- log 9y 


r\9 


exp 


log# 


T(D,t) 


( 2 . 12 ) 


By way of (2.12), the expectation value of the probability that a given unit will 

T(D,t) 


a life-time that exceeds r, then is 

r»l 


E(9) = f 0p[0\T(D,t) ,r,r] d6 = 

Jo 


r+1 


_T(D,t) + r\ ' ^' 13 ^ 

This expectation value, which itself is a probability, is the result of Example 3, given 
in j5|. However, there it was not yet recognized that this probability is the mean of 
an underlying Beta-Like probability distributior^] 

Following Jaynes, we subject ( |2.13[ ) to various extreme conditions, in order to 
show the correspondance with the indications of common sense. 


1 This, if anything, is an attestment to the richness of Jaynes’ work. Even by working through the 
most casual of his derivations, one may still be rwarded for one’s efforts by little gems, like the 
Beta-Like distributions given here. 




























6 


1. EXPLICIT PROBABILITY DISTRIBUTIONS FOR PROBABILITIES 


If the total, say, unit-hours of the test is small compared to prior expected 
life-time t, that is, if x i + Vj <<t. Then, (2.111, 


T(D,t) « t, 


and, unless a large number of failures r is observed, our state of knowledge about 9 
can hardly be changed by the test, and, as a consequence, we have to rely on our 
prior knowledge only. 

If the total, say, unit-hours of the test is large compared to prior expected 
life-time t, that is, if x i + Vj > > t. Then, (2.111, 


T(D,t) « ^2 x i + ^Vj, 
i=l i=i 

and, for all intents and purposes, our final conclusions depend only what we observed 
in the test, and, as a consequence, these conclusions are almost independent of what 
we previously thought previously. 

In intermediate cases, our prior knowledge has a weight comparable to that of 
the test. If t >> t, the amount of testing required is appreciably reduced. For if we 
were already quite sure that the units are satisfactory, then we require less additional 
evidence before accepting them. But if t < < r, that is, if we are initially very 
doubtful about the units, then we demand that the test itself provide compelling 
evidence in favor of their reliability. 


3. Times to Failure Without Failure, the Weibull Model 

This is a repeat of the previous case, with the difference that we now use a 
Weibull failure mechanism instead of an Exponential one. 

3.1. The Probability Model. Say, we have a Weibull failure mechanism, 
then the probability of a failure at time t is 


P(t\ k, A) = kX ( k\) k 1 exp (—tX) k 


dt. 


Consequently, the probability of no failure until time r is 
P(t > t | k, X) = 


kX ( kX) k 1 exp {—tX) k dt = exp (— rX) k 


(3.1) 


(3.2) 


In most reliability problems we will be interested in determining probability 
(3.2). That is, in general we wish to find the probability distribution of 

\k 


= exp 


(-At)* 


(3.3) 


where r is some desired life-time, and A and k are, respectively, the unknown inverse 
failure rate and shape parameter of the Weibull distribution. 
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3.2. The Likelihood, Prior, and Posterior. Say, we have n identical units, 
which we follow in time. If we observe a sequence of r failure times, say, Xi,... ,x r , 
and n — r units that did not fail, these having, consequently, having times without 
failure, say, y x ,..., y n - r . Then, by way of © and ( |3.2| ), the probability of the 
observed data, or, equivalently, the likelihood of the unknown parameters A and k, 
can be written down as 

p(D | k, A) = L{k , A) 


= n*A(A*O fc 1 exp 


i=1 


(—A Xi) k dxi n exp 

3 =i 


(-a%) a 


(3-4) 


(X 


n (A. 


x k-l 

Xi ) exp 


i= 1 


(-Aaq) fc J exp {-Xyjf 

3 = i 


where we let the constant term (dxi) r be absorbed in the proportionality sign. 

If our our prior information consisted of an initial gues of a life-time of, say, t. 
This initial guess corresponds with a prior likelihood of 

\ k 


P(t\k,X) = k\(Xt) k 1 exp 


(-a ty 


dt. 


(3.5) 


Combining the prior likelihood (3.5) with the uninformative Jeffreys’ prior for the 
inverse failure rate and shape parameter, respectively, A and k , 


p{k,X) =p(k)p( A) oc 


(3.6) 


we get, by way of the product rule and the Bayesian proportionality short hand, 
the informative prior of A and k, based on the initial guess of a life-time of t,: 


p{ k, X\ t) oc (A t) k 1 exp 


(-At)* 


(3.7) 


where we have absorbed both the differential dt of ( |3.5| ) and the normalizing constant 
of (3.6) into the proportionality sign of (3.7). 

In pQ, an alternative informative prior is derived, where the piece of prior 
information consists of an initial guess of the time without failure, s*. Now, would 
we have no initial guess whatsoever, neither for the time to failure nor for the 
time without failure, then the proper cause of action would be to assign as an 
uninformative prior the prior (3.6). 


Combining the likelihood with the informative prior, respectively, (3.41 and 
(3.7), by way of the product rule and the Bayesian proportionality short hand, we 
get the posterior for the unknown parameters A and k: 


p(k, X\ £>, t) oc {Xt) k 1 exp (—At) fe fcA (Axi) fc 1 exp 


(-Axi) fe J exp (-A y j ) h 
3=1 


(3.8) 
























1. EXPLICIT PROBABILITY DISTRIBUTIONS FOR PROBABILITIES 


The posterior (3.8) is the probability distribution of the unknown parameters 
A and k, conditional on the data D we have observed and our tentative guess 
of a life-time of t. However, we are not that much interested in the probability 
distribution of A and k. Rather, we are aiming for the probability distribution of 


the probability of the life-time exceeding r, (3.3). 

Would we know the values of A and k exactly, then we could substitute this 
value into (3.3) and, so, get the exact probability 9. Now, we do not know the 
values of A and k exactly. Instead, we have a range of probable values on the A- and 
k- axes, as captured by the posterior (3.8). This corresponds, through a two-to-one 
mapping, with a range of probable values on the 0-axis. 

This two-to-one mapping is, typically, accomplished by way of a Jacobian 
transformation. 


3.3. The Jacobian Transformation. By way of (|3.3|), we have that 

8 = exp (—At)* 


(-log0)' 


So, a possible transformation is 

A = 

T 

which has a corresponding Jacobian of 

(~log6>)~ 


k = K, 


J = 


JL\ JL \ 

09 A 0k A 

0_h 9 h 

09Ok^ 


tkQ 

0 


M«) 

1 


(-log0)~ 


tk9 


(3.9) 


(3.10) 


Substituting (3.9) into the posterior (3.8), and multiplying it with the Jacobian 

log 9 r 


(3.10), gives us the transformed posterior: 

K — 1 


( r \ K ~ 1 

T : (r+ D ^ 1 °q 6 ^ 


exp 


~T(D, t, k) 


(3.11) 


where k is the shape parameter of the Weibull distribution and where 


T(D,t,K) = t K + Y,Xi+Y,yj’ ( 3 - 12 ) 

i =1 j =1 

is the total power-transformed observed time without failure, r is the number of 
observed failures, and r is the life-time that has to be exceeded. 

Note that if we set the shape parameter to k = 1, or, equivalently, we go from 
the Weibull to the more restrictive Exponential failure mechanism, then, by way of 


the proportionality sign, the posterior (3.11) collapses to (2.10), which is at should 
be. 
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Looking at the statistic (3.12), we see that for, say, k = 2, one observation 


having no failure until time y = 7 is equivalent to 49 observations having no failure 
until time y = 1. 

For k = 1, where the Weibull collapses to the memoryless Exponential distri¬ 
bution, one observation having no failure time until time y = 7 is equivalent to 7 
observations having no failure until time y = 1. 

This reflects the Weibull’s dependence on the shape parameter k. For large 
values of k, extended periods without failure become less probable. Thus, observing 
one extended period without failure carries the same weight as observing many more 
short periods without failure. 


If we numerically integrate out the unwanted parameter k out of (3.11), we get 


the posterior of the probability 6 , given the data D and the initial guess of time 
without failure, t: 


p( 8\ t , D,t) = J p[n, 8\ T(D , t , k ), r, r] dn. 


(3.13) 





CHAPTER 2 


Implicit Probability Distributions for 
Probabilities, Part I 


There are instances were we cannot rewrite any of the unknown parameters 


in the posterior as a function of 9 , as was clone, for example, in (1.9), (2.9), and 


(3.9). This, then, prohibits us from finding the explicit form of the corresponding 
Beta-like distribution. 

However, we may still find the first moments of these intractable distributions. 
Thus, allowing us to either approximate the corresponding probability distribution 
or, alternatively, to construct confidence bounds on the estimated probabilities. 


1. Times to m Events, the Poisson Model 


1.1. The Probability Model. Here we define the probability of interest, 9 , 
to be the probability of m events occurring within the time period r, by a mechanism 
which is modeled by an underlying Exponential distribution, having an unknown 
parameter A. 

The data consists of single events observed at variable, though consecutive, 
waiting times ..., t m , and a non-event from the last failure, which is observed at 
time tji onwards, until the end of the time period r: 


P(m| A, r) = 


r T Hjl , 1 tj 


Xe~ xtl Xe~ xt2 ■ ■ ■ Ae - ' Um e -A (’ r- ^-o ^dt r 


■ dtodti 


Jo Jo 

(A rT 

777 .! 


exp (—Ar). 


(i.i) 


An inspection of (1.1) learns us that the probability of interest has the form of 
a Poisson distribution, which has an expected number of events equal to Ar. So, we 
define our probability of interest to be 


9 = — exp (—Ar). (1.2) 

m\ 

1.2. The Likelihood, Prior, and Posterior. In a total time period of, say, 
T, we observe n consecutive times to an even, x ±,..., x n . Assuming an Exponential 
event-generating mechanism, 


P{xi\ A) = Aexp (—A Xi) dxi, 


(1.3) 


11 
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2. IMPLICIT PROBABILITY DISTRIBUTIONS FOR PROBABILITIES, PART I 


the probability of the observed data, or, equivalently, the likelihood of the unknown 
parameter A, which is the expected of events per time unit, may be written down as 


P(D\ A) = T(A) 

= exp —A Xi 

oc A" exp (—AT), 


A exp (—A x.i) dxi 


(1.4) 


where we have absorbed the term ( dxt) n into the proportionality sign. 

If we have an initial guess that the time to an event is t, we may assign the 
informative prior (2.7) 

p( A| t) oc exp (—At). (1.5) 

However, if we do not feel confident enough, to make such a prior guess, we may 
alternatively, assign an uninformative Jeffreys prior 

1 


P{ A) oc 


A' 


Multiplying the likelihood with the informative prior, respectively, (1.4) and 


(1.5), we may obtain the posterior of A: 

p( A| D,t) oc A" exp [—A (T + t)]. 


( 1 . 6 ) 


where the pertinent aspects of the data D are the number of events, n, and the 
total time of observation T. 

As an aside, if we have two data sets of the same phenomena under observation, 
say, Di and Z) 2 , having, respectively, n\ and n 2 observed events in the respective 


periods T\ and T 2 , then these data sets, together with the informative prior (1.5), 
would combine to the posterior 

p{ A| Th, D 2 , t) cx A" 1+ " 2 exp [-A (Tr + T 2 + t)]. 

Now, would we attempt in to make a change of variable from A to 6, then 
we find that A cannot be written unambiguously as a function of 6 for m > 0; where 
to = 0 is equivalent to the Exponential probability model, (2.3). It follows that we 
can make no analytical change of variable for the Poission model probability (1.2), 
or, equivalently, derive the explicit Beta-Like distribution for this probability model. 

But what we can do, is derive the first moments of this intractable Beta-Like 
distribution. This will allow us to either compute the skewness corrected confidence 
bounds for this intractable distribution, or, alternatively, construct the MaxEnt 
distribution which has the same moments as this intractable distribution, respectively, 
3] and [9]. 
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1.3. MaxEnt Distributions. In what follows we give a short outline on how 
to derive fourth-order MaxEnt distributions. Three well-known MaxEnt distribu¬ 
tions are the uniform, exponential, and normal distributions. These distributions 
correspond, respectively, with zeroth-, first-, and second-order MaxEnt distributions. 

For a given probability distribution 

P(0|{A}), 


where 9 is the parameter of interest and {A} is some set of parameters which make 
up this probability distribution, the first four cumulants are given as: 


»=j o P (6\{\})de, 


7 = 


(0~pf p(9\{\})d9, 

J(9-p) 3 p(9\{\})d9 
a 3 ’ 

f(9-p) 4 p(9\ {A}) d9 


(1.7) 


where p is the mean, a is the standard deviation, 7 is the skewness, and k is the 
kurtosis of the probability distribution p{9\ {A}). 

The fourth-order MaxEnt distribution incorporates information about the skew¬ 


ness and kurtosis, respectively, 7 and k, (1.7), as well as the mean and standard 


deviation, respectively, p and < 7 . The algorithm for higher-order MaxEnt distribu¬ 
tions is due to Rockinger and Jondeau, [9;. 

We will now proceed to give the algorithmic steps needed to construct fourth- 
order MaxEnt approximations of intractable Beta-like distributions. 


Step 1. 

The fourth-order MaxEnt distribution we seek takes as its input the first four 
cumulants of the probability distribution we wish to approximate. 






14 


2. IMPLICIT PROBABILITY DISTRIBUTIONS FOR PROBABILITIES, PART I 


For example, if wish to determine the fourth-order MaxEnt distrbution of 9. 
e firsi 

(A rY 


Then we first compute the first four moments of 9, (1.2), (1.6), and (1.7): 

1 


to i = 


m 2 = 


7713 = 


7714 = 


to ! 

M 

to ! 

M 

to ! 

M 

to ! 


exp (—At) 


■ exp (—At) 


A" exp [—A (T + £)] dA, 


A n exp [—A (T + £)] dX, 


i 3 


■ exp (—At) 


■ exp (—At) 


\ n exp [—A (T + t)] dX, 


A" exp [—A (T + £)] dX, 


The moments in (1.8) evaluate to 


TOi = 


to 2 = 


77l 3 = 


TO 4 = 


(m + n)\ I 

/ \ r 

T A 

m\n\ ' 

y T + T ~\~ t J 

(2 TO + 77)! 

( 

(to!) 2 n! 

\2 r + T+t 

( 3 to + 77)! 

( 

(to !) 3 n! 

V 3 r+T + t 

(4777. + 77)! 

( T 

(to !) 4 77! 

V 4 r + T + t 


T + f 
t + T + £ 

2m 


n+1 


T + t 


3m 


4m 


2t + T + £ 

T + t 
3 t + T + £ 

T + t 

4t + T + t 


n+1 


n+1 


n+1 


By way of (1.9) and the identities, [4j, 

= 7711, 


cr = y to 2 - TOf, 

777.3 — 3to 2 TOi + 2mf 


7 = 


7774 — 4to3 TOi + 6to 2 TOf — 3 toJ 


( 1 . 8 ) 


(1.9) 


( 1 . 10 ) 


we may then compute the first four cumulants, needed for the construction of the 
fourth-order MaxEnt distribution. 


Step 2. 

We now plug the third and fourth cumulant, respectively, 7 and k , into the integral 
function 

f b 

Q{<Pi,<P 2 , +3, = / exp [ipxx + (x 2 - l) + ( x 3 - 7) + tpi (+ - Tv)] dx, 

J a 

(1.11) 
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wherein 


a = 

a 


b = 


1 - M 


( 1 . 12 ) 


Then minimize ( TTTTj) over the vector (<^i, </? 2 , y> 3 , <^ 4)1 i n order to obtain the 
minimization estimates {ipi, P2-, P3, Pi)- 


Step 3. 

We then make a change of variable from x to 9, where 

0 — p 


x = 


(1.13) 


in order to obtain the unsealed MaxEnt distribution on the 9 axis: 


p{9\ p, a , 7 , k) (x — exp 
a 


, 9 — p „ / 9 — p 

Pi -h Pi 

a 


■ P 3 


9 — p 


+ Pi 


9 — p 


(1.14) 

The normalizing constant of (1.14) then may be computed by way of the integral, 
(flT2l) and 03]), 


C= [ p(9\p,a,'y,K)d9, 
Jo 


(1.15) 


The properly normalized fourth-order MaxEnt approximation of the intractable 
beta-like distribution, which has its probability model is then given as, ( |1.9| ), 

O0]), Oil), pi, and 05|): 


p(6\p,a,'y,n) = ^ exp 


where 0 < 9 < 1. 


^ 9 — p „ (9 — p 
Pi - \-p2 


■ P 3 


9 — p 


■ Pi 


9 — p 


(1.16) 


2. Times to m Events, the Cumulative Poisson Model 

2.1. The Probability Model. In the preceeding discussion we discussed the 
Beta-Like distribution of the probability of m events occurring within the time period 
r, by a mechanism which is modeled by an underlying Exponential distribution, 
having an unknown parameter A. This probability distribution may be of interest 
if we have a parallel system of m non-redundent fail-safe mechanisms, where each 
mechanism admits an Exponential time to failure model. 

Now, we can envisage scenarios in which we wish to determine the Beta-Like 
distribution of the probability of more than m events occurring within the time period 
r, by a mechanism which is modeled by an underlying Exponential distribution, 
having an unknown parameter A. One such scenario may be where we have a system 

1 The limits of integration are determined by the identities: 

p + aa = 0 , fi + ba = 1 . 
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which is subject to successive loads, each load admitting an Exponential time to 
occurrence model. The system then might be hypothesized to be able up to to such 
loads, before a significant wear and tear occurs. 

For this scenario the probability model of interest is 


m ,, o 

= 1 - J2 -^r exp • 

I—o *■ 


( 2 . 1 ) 


Now, (2.1) is just a probability, just like, say, (1.2) is, which admits an un¬ 
certainty regarding the actual value of the inverse failure rate A, as captured by 


the posterior (1.6). So, we may proceed to compute the first four moments of 
the Beta-Like probability distribution which results from the uncertainty we have 
regarding the actual value of A: 

' v' ( Ar )' / nl 1 

mi= I 1-2^ —— exp (-At) 


i=0 


A n exp [-A (T + t)} d\ 


m 2 = 


m 3 = 


777-4 = 


m /\ \i 

1 - J2 4- exp (-At) 


i=0 


1 - J2 4- ex p (- Ar ) 


i=0 


m (\ \i 

1 - J2 ex P ^ _At ) 


1=0 


X n exp [—A (T +1 )] dX, 


A” exp [—A (T +1 )] dX, 


A” exp [—A (T +1 )] dX, 


( 2 . 2 ) 


Having obtained these moments we may compute the relevant cumulants, by way of 
(1.10), and either proceed to construct the fourth-order MaxEnt approximation of 


the intractable Beta-Like distribution of (1.16). 


3. Predictors, the Poisson Regression Model 

3.1. The Probability Model. In the Poisson regression model the number 
of events occuring, to, in a given time period, t, has Poisson distribution: 

P(m\ X ,t) = exp (-At) , (3.1) 

to ! 

where the logarithm of the expected number of events per time unit, A, is modeled 
by way of a regression model: 


log A = /3 0 + Piz. 


(3.2) 
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If we take the exponential of (3.2), and substitute it into (3.1), we get the probability 
model: 


a [exp(/3 0 +/3i z)r] m 

9 = -t -exp [- exp (/3 0 + Piz) r] 


ml 


ml 


exp [m (/3 0 + Pi z) - exp {p 0 + p x z) t] 


(3.3) 


3.2. The Likelihood, Prior, and Posterior. The data D consists of n 
counts, ri,..., r n , with corresponding predictor values Xi ,..., x n . Using (3.3), the 
probability of the data D, or, equivalently, the likelihood of the unknown parameters 
Po and Pi, may be written down as 


p{ D\ Pq, Pi) = JJ —f exp [ri (p 0 + piXi) - exp (p 0 + piXi) r] 
i=l 
n 

OC n exp [rt {p 0 + PiXi) - exp (/3 0 + PiXi) t\ 


(3.4) 


oc exp £ n ( Po + PiXi) - T ^ exp {Pq + PiXi) 

. i i 

We assign some uniform prior to the unknown regression parameters Pq and Pi : 

n 

p{D | Pq,Pi) oc P exp [ri {Pq + PixP) - exp (/3 0 + PiXi)r\ (3.5) 


i=1 


By multiplying the likelihood (3.4) with the prior (3.5), one may obtain the 
unsealed posterior of Pq and pi: 


p{Po,Pi\ D ) oc p{po,Pi) oc exp 


Y. r i {Po + PiXi) -T Y ex P {Po + PiXi) 


. (3.6) 


Since we can make no analytical Jacobian transformation from the Pq and Pi 


to the probability model 9 , we compute the first moments of (3.3), by way of the 
posterior (3.6), and proceed to construct either the skewness corrected confidence 
interval or the MaxEnt approximation of the corresponding Beta-Like distribution. 















CHAPTER 3 


Implicit Probability Distributions for 
Probabilities, Part II 


We will here construct the Beta-Like distribution of a Poisson-Like probability 
model. We define the probability of interest, 9 , to be the probability of to events 
occurring within the time period t, by a mechanism which is modeled by an 
underlying Weibull distribution, having an unknown parameters A and k. 

The advantage of a Weibull over an Exponential mechanism, is that the shape 
parameter k of the former represents an extra degree of freedom, as it may take on 
any value greater than zero; whereas, in the latter it is dogmatically set to one. 

Regular Poisson distributions have as their event-generating mechanism Expo¬ 
nential distributions, (1.1). In contrast, a sequence of Weibull distributed events 
leaves us with an analytically intractable integral. 

However, making use of an unexpected equivalence relationship, we may work 
around the encountered integral and, so, proceed to approximate the Poisson-Like 
distribution. 


1. The Issue 


We first present the case for the regular Poisson distribution, as this will give 
us a handle on how to generalize from the Poisson to the Poisson-Like distribution. 

Let 9 be the probability of to failures occurring within time period r. The 
failure mechanism generating these events is the Exponential distribution, having 
an unknown parameter A. The data we will use is failures at variable, though 
consecutive, times ,f m , and a non-failure from the last failure, at time t m , 


onwards, until the end of the time period r, (1.1): 
r T rr-t i r T ~Y 


3=1 i 


\e~ xtl \e ~ xt2 • • • \e- xt ™e~ x ( T -^i *>)&„ 


1 0 JO 


/o 


= A" 


At 




dt m ■ ■ ■ dt- 2 dt\, 


i o Jo 


where it may be found, by way of induction, that 


r-Y" 


dt n 


■ dt 2 dti = — 


' o 


TO! 


• dt 2 dti , 

( 1 . 1 ) 

( 1 . 2 ) 
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Substituting (1.2) into <o>, we find that 6 is just the traditional Poisson 
probability of to events occurring in a time period r, <0 


9 = (AT) , exp (-At) , (1.3) 

to ! 

where At is the expected number of failures within the time period r. 

We again define 6 to be the probability of to failures occurring within the time 
period r, but now by a failure mechanism which is modelled by an unerlying Weibull 
distribution having parameters A and k. 

The Weibull model, having one more parameter, is more flexible than the 
Exponential model. In fact, the Exponential is a special case of the Weibull, were 
we set the shape paramter k to one. 

The data, again, consists of failures at variable, though consecutive, times 
t\,... ,t m , and a non-failure from the last failure, at time t m , onwards, until the 
end of the time period r: 


0 = 



r-J2T = i 1 tj 


kX (Afi) fc_1 e~ (xtl)k k\ (A t 2 ) k ~ l e~ (xt2)k ■ ■ ■ (1.4) 

■ ■ ■ kX(Xt m ) k ~ 1 e~^ Xtrn ^ e -[-^(T-Hj b)] dim ■ ■ ■ dt 2 dti. 


Integral (1.4) does not allow for a simple analytical expression like (Id}. So, by 
way of the curse of dimensionality, as to » 1 , we are prohibited from evaluating the 
first cumulants of (1.4) and, as a consequence, constructing either a confidence bound 
or a MaxEnt approximative distribution. However, there is a useful equivalence 
relation which may be used to find these cumulants after all. 


2. The Equivalence Relation 


The equivalence relationship is derived first for the regular Poisson model (1.1). 
Since, only for this regular case do we have the analytical solution of the target 
integral with which we can compare the alternative route. 

Let q be the sum of n + 1 waiting times, which are generated by independent 
Exponential processes: 


q = t i-I-htn+i, ti ~ Exp (A). 


(2.1) 


Then we may derive the probability density function of the stochastic q from the 
product of n + 1 Exponential distributions, 

n+1 

p(ti,...,t n+ i\ A) = ]^[ A exp (A ti), 

i—1 


( 2 . 2 ) 
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and some appropriate Jacobian transformation like, for example, 

ti=q~t' 2 - t' n+1 , 

t-2 = ? 2 , 


(2-3) 




n +1 * 


By way of (2.2) and (2.3), while keeping track of the appropriate integration 


limits of the n unwanted parameters, t 2 ,... ,t' n , i, we may find the probability 
distribution of q: 


p(q\X)=X n+1 exp(-Xq) 


q-t 2 


dt' m ■ ■ ■ dt 3 dt 2 


o Jo 


= X^—^f- exp (—Xq). 


(2.4) 


Now, as it turns out, the cumulative distribution function of q 1 that is, the sum 
of n + 1 Exponential waiting times, 


P{q<r\X)=f X^—Jj— exp (—Xq) dq, 
Jo 


(2.5) 


is equivalent to the probability of observing more than n events in the time period 

r. 


P{i > n\ A,r) = 1 - ^ 


M 

7.1 


exp (—At) . 


( 2 . 6 ) 


2—0 


So, the equivalence relationship we will make use of is, (2.5) and (2.6), see also (2.1), 

r i ( A ?) 


A- 


/ \ \2 

exp (—Xq) dq = 1 — exp (—At) 

z—/ i 


i=0 


or, equivalently, 


1 - 


i ^ (At) 

ex P y-M) d( i = i^ - 

3 ' i =0 


exp (—At) . 


(2.7) 


( 2 . 8 ) 


In words, if the sum of waiting times for n + 1 waiting times is smaller than t, 
then it follows that we have observed more than n events occurring at time t. The 


equal sign in (2.7) implies that both states of knowledge have the same truth-value, 
that is, are equivalent. 

Likewise, if the sum of waiting times for n + 1 events exceeds t, then it follows 
that we yet have to observe more than n events occurring at time t. The equal sign 


in (2.8) implies that both states of knowledge have the same truth-value, that is, 


are equivalent. 

The product and and sum rules of Bayesian probability theory are derived by 
way of consistency constraints, where consistency is operationalized as follows. If 
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there are there two different routes that lead us to the self-same proposition, then 
these routes should result in the same probability assignment. This then explains 
the equivalencies (2.71 and (2.81, consistency demands it, [6j. 

rival 

(At 


The corollary of the equivalence relation (2.8) is, see (1.2), 
(At 


n! 


exp 


(" At ) = 51 


2=0 


exp 


r r ( A( ?) n_1 , x r , r\ (MT , , w 

/ A 7 - —- exp (-Ag) dq - / A-— exp(-A q)dq 

Jo («-!)! Jo n\ 


(-At) - ^ 


^exp<-A,) 

7. 


2=0 


= / A 


T , (MY 


o (n - 1 )! 


exp (—A q) ( 1 - — ) dq, 


(2.9) 


where n > 1. It may be easily checked that the corollary equivalence (2.9) does 
indeed hold. 

Now, we may compute the cumulants of the Poisson distribution of q either by 


way of the evalution of the moments of the probability distribution (2.4) or by way 


of the evalution of the moments of the stochastic (2.1). 

The cumulants of a given Exponential waiting time is given as: 

1 

M= A’ 

1 

= A’ 

7 = 2, 

k = 9. 


( 2 . 10 ) 


So, the cumulants of n + 1 exponential waiting times, (2.1), are given as: 

n + 1 


Hn+l — 

Cn+l = 
7n+l = 


A ’ 
yjn + 1 

2 


\Jn +1 ’ 

K n +i = —— + 3, 
n + 1 


( 2 . 11 ) 


Subtituting the cumulants (2.11) in the MaxEnt approximative distributiorQ 
with the integral limits 


a = 


- 0, if fi — 6er < 0 

H — 6a < 0, else. 


b = n + 6cr, 


( 2 . 12 ) 


^See Section 


1.3 
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for the integral function (1.11), and going through the motions, we obtain for, say, 
A = 1 the following MaxEnt distribution: 



Figure 1. MaxEnt distribution of sum of n + 1 Exponential waiting times 



Figure 2. Analytical distrubution of sum of n + 1 Exponential 
waiting times 


By way of the equivalency (2.9), the road is now opened to evaluate the first 
cumulants of (1.41. This will allow us construct the MaxEnt approximation, as 


discussed in Section |1.3[ of the Beta-Like distribution of a Poisson-Like process, 
where the event generating mechanism follows a Weibull instead of an Exponential 
distribution. 
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3. Making Use of the Equivalence Relation 


Let q be the sum of n + 1 waiting times, which are generated by independent 
Weibull processes: 


q — ii + • ■ ■ + tn+ij ti ~ Weibull ( k , A), 
where the Weibull distribution is given as 

p(t\ k, A) = kX (fcA) fe_1 exp {—tX) k 


The cumulants of a given Weibull waiting time are given as, (3.2): 

nOO 

fi — / t p(t\ /c, A) dt , 

Jo 

a = (t- pf p(t\ k,X)dt, 

1 

7 = — / (f- /x) 3 p(t| fc, X)dt, 

a Jo 

k=— / (t - p) A p(t\ k, \)dt, 

a Jo 

which evaluates to 

r(i) 


n = 


7 = 


kX ’ 

^ r( M2)_r(Mi) 2 

A 

fc3r(^)-6^r(f)r(i) + 2r(i) 3 


k 3 




fc 4 r(^)-i2fc 2 r(|)r(i) + i2fcr(|)r(i) 2 -3r(i) 4 


A: 4 


r[f)-rm 


(3.1) 

(3.2) 


(3.3) 


(3.4) 


So, the cumulants of n + 1 Weibull waiting times, are given as, ( |3.4| ): 

Hn +1 = (n+ l)p, 

cr„+i = Vn + la, 


ln+1 ~ 


^n+1 — 


7 


y/n + 1 

K 


(3.5) 


n + 1 


+ 3, 
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Likewise, the cumulants of n Weibull waiting times, (3.1), are given as, (3.4): 


Hn = n H, 


7 n = 


\Jna , 
7 


(3-6) 


K 

= -h 3, 

n 


The cumulants (3.5) and (3.6) may be used to construct the MaxEnt approx¬ 
imations of, respectively, the distributions of n + 1 and n Weibull waiting times. 


These MaxEnt distributions may then substituted into (2.9), in order to get an 


approximation of the intractable Poisson-Like probability model (1.4). 


In order to construct the Beta-Like distribution of the Poisson-Like probability 


model (1.41, the MaxEnt distributions that take as their inputs the cumulants in 


(3.5) and (3.6), which are functions of the unknown parameters k and A, (3.4), have 


to be weighted by the Weibull posterior (3.8), and summated. 


For example, if partition the 6-sigma of ( k , A)-parameter space in a n-by-n grid, 
then we substitute the center coordinates of the (k, A) squares in the cumulants 


(3.5) and (3.6), construct and weigh the resulting n x n = n 2 MaxEnt distributions 


with probability volumes of the corresponding squares, we then summate these 
weighed MaxEnt distributions, which will leave us with an approximation of the 


highly intractable Beta-Like distribution of the Poisson-Like probability model (1.4). 


Likewise, if we wish to find the Beta-Like distribution of the Poisson-Like 


equivalence of (2.1), then by way of (2.4), (2.7), and (3.5), we may construct, with 


the above described procedure, of weiging Maxent distributions, its approximative 
distribution. 





















CHAPTER 4 


Bayesian Model Selection 


In Bayesian statistic there are four entitities of interest: the prior, the likelihood, 
the posterior, and the evidence. Now, anyone somewhat familiar with Bayesian 
statistics probably knows about the first three of these entitites, since these are 
needed for Bayesian parameter estimation. However, the fourth entity, the evidence, 
essential for Bayesian model selection, is less well known. 

This is unfortunate. Because, even if the posterior represents the optimal 
parameter estimation procedure, if the model employed is inappropriate, then the 
optimality of the parameter estimation procedure will not make the underlying 
model less inappropriate. And we quote Skilling: 

I know of no other discipline in which half of the principal equation 
is so widely ignored, and it should not be ignored here either. I 
could (and often do) argue that the evidence 

Z = p(D) = jp(D\ A) d\ 


is even more important than the posterior 


P(D | A) 


P(A,T>) 

p(D) 


P(\D) 

Z 


on the frounds that algebraically it has to be evaluated first, and 
logically there’s no need to proceed to the posterior if the evidence 
is unacceptably weaker than that from some other candidate. So 
it’s the posterior that is subordinate to the evidence and definitely 
not the other way around. I myself think of “Bayesian inference” 
as the generation of the evidence, with the posterior following if 
needed. Evidence is primary. 


Now, the reason that we have gotten as far as we have, algebraically speaking, 
without introducing the concept of the evidence, is because we have made use of 
the fact that the prior time the likelihood is proportional to the posterior: 


7t(A) L(X) oc p( A| D ), (0-7) 

where 7 t(A) is proportional to the prior p( A) and L( A) is proportional to p(D | A), 
the probability of the data given the parameter A. 
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Seeing that any probability distribution should integrate to one, we may properly 

7t(A)L(A) 


normalize (0.7) by way of the indentity: 

tt(A) i(A) 


p(A|jD) / 7 t(A) L(X) dX 
Note that the normalizing constant 


G 


is only equal to the evidence 


C = J 7t(A) L(X) dX, 
Z = I p(X)p{D\X)dX, 


( 0 . 8 ) 


(0.9) 


( 0 . 10 ) 


if the 7r and L are both properly normalized; where the former is normalized over 
the unknown parameter A and the latter over the data D. 

Note that until now, by way of the use of proportionality sign, we have used 


the Bayesian short hand (0.7) to present our posteriors. In what follows, We will 


compute, for demonstrative purposes, the evidences for the models in which the 
generating failure mechanisms are Exponential and Weibull, respectively. But first 
we will give a simple outline of the procudure of Bayesian model selection. 

1. Bayesian Model Selection 

Let p( A| I) be the prior of some parameter A, conditional on the prior background 
information /. Let p(D | A, M) be the probability of the data D, conditional on a 
given parameter A and the particular likelihood model M which was invoked. Let 
p{X\ D, M, I) be the posterior distribution of A, conditional on the data D 7 the 
likelihood model M, and the prior background information I. 

We then have, by way of the product rule, or, equivalently, Bayes’ theorem, 
that, : 

p(X\D,M,I) = 

where 

p(D\M,I) 

is the marginalized likelihood of the model M and the background information /, 
also known as the evidence of M and I. 


P ( A I)p(D\ A, M) 
p{D\ M, I) ’ 

(1.1) 

f p(A I)p(D\ X,M) dX, 

(1.2) 


Note that the evidence judges (1.2) judges both the likelihood model, M, as 


well as the prior model, /, that went into the construction of the posterior. Now, 
this could be seen as a weakness of Bayesian model selectioiQ since in general all 
the ingenuity goes into the construction of a sophisticated likelihood model. So, 


1 As was once suggested to the first author, during a colloquiem on Bayesian model selection. 
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why bother with some uninformative prior, when we only want to compare different 
likelihood models? 

There are two reasons why it is a good thing that Bayesian model selection takes 
into account both the prior and the likelihood, and does not neglect the former. 

Firstly, there are instances, for example in image reconstruction [§J, where all 
the artfulness goes into the construction of the prior, and there we have that it is 
the likelihood which is trivial. So, it is precisely because the Bayesian probability 
theory puts the prior and likelihood models automatically on an equal footing, that 
Bayesian model selection can differentiate between the different prior models of 
image reconstruction inference problems, without breaking down. 

Secondly, by judging the prior the evidence automatically guards us against the 
danger of over-paranretrization, that is, choosing such a complex likelihood model, 
in terms of the number of parameters employed, that we fit the noise in the data as 
structural part of the data. 

Say we have m different likelihood models, M 1; ... ,M m , to choose from and 
one class of, say, uninformative prior background models, I. Then we may compute 
to different evidence values p{D\ M v I). for j = 1,..., to. 

Let p(Mj) be the prior probability distribution of the likelihood models Mj , and 
let p(Mj | D,I ) be the posterior probability distribution of these models, conditional 
on the data and the general prior background information I. Then we have that 


piM^DJ) 


P {Mj)p(P \ Mj,I) 

'EiP(Mi) P (D\M i ,I) 


(1.3) 


for j = 1 ,..., to . 

Note that if we have that p(Mj) = p{Mk ), for j ^ k, then we have that (1.3) 
reduces to 


p(M j \D,I) = 




(1.4) 


Stated differently, if we assign equal prior probabilities to our different likelihoods 
models, then the posterior probabilities of these likelihood models reduce to their 
normalized evidence values. This, then, is why the likelihood models may be ranked 
by their respective evidence values, 7j. 
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2. Computing Evidence Values 

For the Exponential model, which we designate Mi , we have that the likelihood 
model is given as, (2.4), 

r n—r 

p(D | A, Mi) = A exp (-A Xi) dxi P exp (-A yj) 

o =i 


2 = 1 


= A r exp 


n—r 

-A ( H Vi 

»=1 4=1 


I \dxi. 


As a prior we take the properly normalized uninformative prior: 

p(X\I) = ^, 

where C\ is the normalizing constant 

^ dX 


C, = 


'={ 

J a\ 


A 


log - log a\. 


( 2 . 1 ) 


( 2 . 2 ) 


(2.3) 


where a\ and b\ dehne the prior range of possible values of A. 

Cogent prior information regarding a\ is that value of A for which, for some 
given time interval tau 1 the expection value At would become so small that too few 
failures would be witnessed in said time period. Cogent prior information regarding 
b\ is that value of A for which, for some given time interval tau, the expection value 
At would become so large that too many failures would be witnessed in said time 
period. 


Multiplying the properly normalized likelihood (2.1) with the the properly 
normalized prior (2.2), we obtain the the properly normalized bivariate distribution 
of both the parameter and the data: 

p(X, D\ Mi,I) = C\X r ~ 1 exp 


■ n—r 

-X ( £,; + ]T Uj 

*= 1 4 = 1 


dxi, 


(2.4) 


which, being properly normalized, will allow us to evaluate the evidence of M\. 


By way of (1.2) and (2.4), we then evaluate the evidence for the Exponential 
model as 

p{D\M 1 ,I) = [ A p(X,D\ Mi, I)dX 


= C\ P dxi 


fb\ 

1 

/ A r_1 exp 

-A 

)a x 

\ 


,Vi 


dX 


(2.5) 


C x 




(E i -c + E, Vi 


\\dxi. 


( 2 . 6 ) 
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For the Weibull model, which we designate M 2 , we have that the likelihood 
model is given as, (3.4), 


p(D\k,X,M 2 ) = Y[kX(Xx i ) k 1 exp (—A Xi) k dxt exp (—\y.j) k 


j~ l 


= X rk k r exp 


, »=1 i=1 




(2.7) 


i=l 


As a prior we take the properly normalized uninformative prior: 

p(k,X\I)=p(k\I)p(X\I) 

= 

k x 


where C\ is as in (2.3) and C k is the normalizing constant 

rbk 




dk 


log b k - log a k . 


( 2 . 8 ) 


(2.9) 


where a k and b k define the prior range of possible values of k. 


Multiplying the properly normalized likelihood (2.7) with the the properly 


normalized prior (2.8), we obtain the the properly normalized bivariate distribution 
of both the parameters and the data: 


p(k, A, D \ M 2 , 1) = C x C k X rk ~ 1 k r ~ 1 exp 


, t= 1 j =1 


| ~\x k 1 dx i , 


i =1 


( 2 . 10 ) 

which, being properly normalized, will allow us to evaluate the evidence of M 2 . 

By way of (1.2) and (2.10), we then evaluate the evidence for the Weibull model 


as 


p(D\M 2 ,I) = 


r b x 


p( k , A, D | M 2 , 1) dX dk 


= C\C k dxi 

i 

~C x C k (r — 1)! Y[d Xl 


fbk rb\ 

/ / X rk 1 k r _1 exp 


J d k dax 

v< > } j 


rH ^ n ^- 1 

(e 4 x i + Ej Vj ) 


dfc, 


n* 

i 

( 2 . 11 ) 


where the integral over unknown shape parameter k must be evaluated numerically. 

Say, we do not have any prior preference for either model Mi or M 2 . Then, 
letting the data speak for itself, we assign equal prior probabilities to both likelihood 


models. We then, by way of (1.4), (2.5), and (2.11), may compute the posterior 


1 dX dk 
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probability of M \: 

P( '11 D ’ J) = P( D I Mi! I ) 1 + p( D | M 2 , /) 

[C'a ( r — 1)! IL d x i\ {yitXii \ liVj y 


[Cx (r - 1)! a dxt] ( Eigi+ 1 Ej ,. ) - + [Ca (r - 1)! IL dx t \ C k J ( £~ x ]+£‘ y fy dk 

(E* + Ej.'//) 


( 2 . 12 ) 


d/c 


where we have cancelled out all the terms shared by both evidence values (2.51 and 


P-liP - 

Furthermore, seeing that, the normalizing constant C \, (2.3), cancels out, we 
may let (2.2) be an improper prior and let the constants of integration go to a\ —> 0 
and b\ —> oo. This allows us to replace the ‘approximately-equal-to’ signs in (2.5) 
and (2.11) with an equality signs, which propagates through in (2.12). 

By way of (1.4), (2.5), and (2.11), we may also compute the posterior probability 
of M 2 : 


p(M 2 \D,I) = 


p(D\M 2 ,I) 


dk 


p(D\ Mi, I) +p(D\ M 2 ,1) 

[C x (:r - 1)! Eli dxt ] C k f 
[C X (r 1)! n, dx t ] + [Cx (r ~ 1)! IL dx t ] C k J 


dk 


Ckf 


(E^+E^)’ 


dk 


E i x i + Ei Vj ) + c k J 


fcr- 2 n,x, 

(e7^F+e“W 


(2.13) 


dk 


where we have again cancelled out all the terms shared by both evidence values 
( fE5| ) and ( f2~TT| ). 

3. A Word of Caution 

Note that for A we, eventually, used an improper uninformative prior. We did 


so when we let constants of integration in (2.3) go to ax —> 0 and bx —> oo; thus, 
giving us a normalizing constant of 


r ~ . r dx ™ 

C > = ./„ T = ” 


(3.1) 


Cx = — = 0. 

OO 


or, equivalently, 


(3.2) 
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The only reason we had the freedom to do so was because the constant C\ 
cancelled out in (2.12) and (2.13). Thus, removing the inverse infinities, which 
resulted from the improper, that is, diverging, prior of A. Now, had we done the 
same for constants of integration in (2.9), then we would have obtained an inverse 


infinity Ck = 0, which would have not cancelled out in (2.12) and (2.13); thus giving 
us posterior model probabilities: 


and 


p(M 1 \D,I) = 


p(M 2 \D,I) = 


E i x i + E j yj) 
(E i x i + E j yj) 


(E* + E j Vi) 


= i, 


= o. 


(3.3) 


(3.4) 


We see in (3.3) and (3.4) how Bayesian model selection may punish us for 
non-parsimoneous priors when the normalizing constants of these priors do not 
cancel out in the posterior of the competing likelihood models. To the uninitiated 
this may seem as a bother. But we Bayesians would not have it any other way. 
Because it is this penalizing mechanism of Bayesian model selection, which is just 
a straight forward of the product and sum rules, which automatically protects us 
from the dangers of over-fitting. 

For example, these authors had to choose among regression model^] having four 
up to a thousand possible regression coefficients to model noisy data. By deriving a 
parsimoneous prior for the regression coefficients, [3j , we were able to let the data 
do the talking. We found that the Bayesian probability theory picked the likelihood 
model having only sixty-four parameters. Those models having more parameters, 
though having a better likelihood fit, because of the greater number of parameters, 
were penelized for their prior probability volume and, as consequence, noise was 
minimally fitted as part of the structure. 

The take-home message from all this is the following: In the computing of the 
evidences, (1.2), 

(1) improper priors should only be used if their normalizing constants will 
cancel out in (1.3), and 

(2) priors whose normalizing constants do not cancel out should be as parsi¬ 
moneous as possible. 


2 Those regression models being C-splines models, [I]. 
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